Ida Sofie Adrian (s243903), Mariam Dalia (s242707), Mathilde Melgaard (s243633), Nina Zomorrodnia (s243923), Victor Hansen (s243634)
Introduction & Aim
Introduction
Cervical cancer is a significant cause of mortality in low-income countries.
The Data:
Medical records from 858 female patients.
Random sampling of patients between the years 2012 and 2013.
Gynecology service at Hospital Universitario de Caracas in Caracas, Venezuela.
Aim
Find correlations between variables and cancer diagnosis
Find correlations between different interesting variables
Chi-squared test: To evaluate whether there is a significant association between two variables.
PCA: To identify patterns and relationships in data.
Visualization: To present the data using various plots.
Methods: Data cleaning
Changing ? to NAs
Changing values 1.0 and 0.0 to yes and no
Changing columns to numeric
Renaming columns for consistency
# Changing values data_raw <- data_raw |>mutate(Smokes =case_when( Smokes =="0.0"~"no", Smokes =="1.0"~"yes"),Hormonal.Contraceptives =case_when( Hormonal.Contraceptives =="1.0"~"yes", Hormonal.Contraceptives =="0.0"~"no"),IUD =case_when( IUD =="0.0"~"no", IUD =="1.0"~"yes"),Dx.Cancer =case_when( Dx.Cancer =="0"~"no", Dx.Cancer =="1"~"yes"))# Tidyingdata_clean <- data_raw |>rename('Smokes.years'= Smokes..years.,'Smokes.packs.years'= Smokes..packs.year.) |>rename_with(~gsub("^STDs\\.\\.", "", .)) |>rename_with(~str_remove(.,"\\.$")) |>#removes the '.' from the last word in columnsrename_with(~str_replace_all(., "\\.", "_")) |>rename_with(~str_replace_all(.x, "__", "_"))
Methods: Data augment
Creating patient IDs
Pivot_longer to create new STD column
Tidying names in rows
Creating new count column for STD
Changing the order of columns
# Changing data to longdata_long <- data_clean |>pivot_longer(cols =starts_with("STDs_"), names_to ="STD_type", values_to ="has_STD") |>mutate(STD =ifelse(has_STD ==1, STD_type, NA) # Keep STD name where 1 is present) |>group_by(ID) |>mutate(# Concatenate STD names for each ID, if none, set "No"STD =ifelse(all(is.na(STD)), "No", paste(na.omit(STD), collapse =", "))) |>ungroup() |>select(-STD_type, -has_STD) |>distinct() |>separate_rows(STD, sep =",")# Creating new columndata_long <- data_long |>group_by(ID) |>mutate(Number_of_STDs =if_else(all(is.na(STD) | STD =="No"), 0, n_distinct(STD, na.rm =TRUE))) |>ungroup())
Results: Barplot
Few women have STDs, regardless of cervical cancer diagnosis status
Women with cancer: HPV seems to be the only STD present.
Women without cancer: A few patients with different types of STDs, but no cases with HPV specifically.
Results: Boxplot
Hypothesis: Women with first sexual intercourse at a young age tend to get STDs and later cervical cancer.
No correlation between number of sexual partners and cervical cancer diagnosis status.
Small correlation between the age of first sexual intercourse and cervical cancer diagnosis status:
Women with cervical cancer were older when they had first sexual intercourse compared to women without cancer.
Hypothesis does not seem to be true.
Results: Visualization Function
create_proportional_barplot <-function(data, x_var, fill_var, x_label =NULL, y_label ="Proportion", fill_label =NULL, title =NULL) {if (is.null(x_label)) x_label <- x_varif (is.null(fill_label)) fill_label <- fill_varif (is.null(title)) title <-paste("Proportion of", fill_var, "by", x_var) filtered_data <- data %>%group_by(ID) %>%summarise(x_value =first(!!sym(x_var)),fill_value =first(!!sym(fill_var)),.groups ="drop" ) %>%filter(!is.na(x_value), !is.na(fill_value)) %>%mutate(fill_value =factor(fill_value))# Ensure color palette matches the levels of 'fill_value' color_palette <-setNames(c("lightblue", "darkred", "green", "orange")[1:length(levels(filtered_data$fill_value))],levels(filtered_data$fill_value))ggplot(filtered_data, aes(x = x_value, fill = fill_value)) +geom_bar(position ="fill") +labs(x = x_label, y = y_label, fill = fill_label, title = title) +scale_fill_manual(values = color_palette) +theme_minimal() +theme(axis.text.x =element_text(angle =0, hjust =0.5))}proportion_plot <-create_proportional_barplot(data_aug, x_var ="Dx_Cancer",fill_var ="Smokes")ggsave("../results/images/05_proportion_plot.png", plot = proportion_plot)
Function for Correlation Visualization of Categorical Variables
Input: dataset and two variables
Output: stratified bar plot
Compare categorical variables such as: IUD, Hormonal Contraceptives, Smoking, Cancer diagnosis
Results: Function for correlation
Function for contingency matrix and correlation score
Input: Data set and two variables
Output: Contingency table and chi-2
calculate_chi_squared <-function(data, var1, var2) {# Summarize the data so each ID has one row, taking the first occurrence of var1 and var2 filtered_data <- data %>%group_by(ID) %>%summarise(var1_value =first(!!sym(var1)), var2_value =first(!!sym(var2)),.groups ="drop") %>%filter(!is.na(var1_value), !is.na(var2_value)) # Remove rows with NA values# Create the contingency table contingency_table <- filtered_data %>%count(var1_value, var2_value) %>%pivot_wider(names_from = var2_value, values_from = n, values_fill =0) %>%column_to_rownames("var1_value") %>%as.matrix()# Perform the chi-squared test chisq_result <-chisq.test(contingency_table)return(list(contingency_table = contingency_table, chisq_result = chisq_result))}calculate_chi_squared(data_aug, var1 ="Dx_Cancer", var2 ="Dx_HPV")
Arrows in similar directions = positive correlation
Arrows in opposite directions = negative correlation
Arrows near the origin = minimal contribution to the components
The length of an arrow = amount of contribution to the components
Extract Eigenvalues of the PCA
Represents the amount of variance explained by each principal component
PC4 and PC5 explains 50-60% of the variance
Conclusions
Potential correlation relationships between:
HPV and cancer diagnosis: Cases of HPV diagnosis correlate with cases of cervical cancer diagnosis
Age at first sexual intercourse and cancer diagnosis: Higher age at event of first sexual intercourse correlated better with positive cervical cancer diagnosis
Age and cancer diagnosis: Older ages seen in women diagnosed with cervical cancer
Use of IUD and cancer diagnosis: Use of IUD showed potential correlation with cancer diagnosis